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ABSTRACT 

A hidden database refers to a dataset that an organization makes 
accessible on the web by allowing users to issue queries through 
a search interface. In other words, data acquisition from such a 
source is not by following static hyper-links. Instead, data are ob- 
tained by querying the interface, and reading the result page dy- 
namically generated. This, with other facts such as the interface 
may answer a query only partially, has prevented hidden databases 
from being crawled effectively by existing search engines. 

This paper remedies the problem by giving algorithms to extract 
all the tuples from a hidden database. Our algorithms are provably 
efficient, namely, they accomplish the task by performing only a 
small number of queries, even in the worst case. We also establish 
theoretical results indicating that these algorithms are asymptoti- 
cally optimal - i.e., it is impossible to improve their efficiency by 
more than a constant factor. The derivation of our upper and lower 
bound results reveals significant insight into the characteristics of 
the underlying problem. Extensive experiments confirm the pro- 
posed techniques work very well on all the real datasets examined. 

1. INTRODUCTION 

It is known that existing search engines can reach only a 
small portion of the Internet. They crawl HTML pages inter- 
connected with hyper-links, which constitute the so-called surface 
•web. Nowadays, an increasing number of organizations (e.g., com- 
panies, governments, institutions, etc.) bring their data online, by 
allowing a public user to query their back-end databases through 
context-dependent web interfaces. More specifically, data acqui- 
sition is performed by interacting with the interface at runtime, 
as opposed to following hyper-links. As a result, those back-end 
databases cannot be effectively crawled by a search engine under 
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Figure 1: Form-based querying of a hidden database 

the current technology, and therefore, are usually referred to as hid- 
den databases. 

Consider, for example, Yahoo! Autos (autos.yahoo.com), a 
popular website for online trading of automobiles. A potential 
buyer specifies her/his filtering criteria through a form as illus- 
trated in Figure 1. The query is submitted to the system, which 
runs it against the back-end database, and returns the result to the 
user. What makes it non-trivial (for a search engine) to crawl the 
database is that, setting all search criteria to ANY does not accom- 
plish the task. The reason is that a system typically limits the num- 
ber k of tuples returned (k = 1000 for Yahoo! Autos, at the time 
this paper was written), and that repeating the same query may not 
retrieve new tuples, i.e., the same k tuples may always be returned. 

The ability of crawling a hidden database comes with the ap- 
pealing promise of enabling virtually any form of processing on 
the database's content. The challenge, however, is clear: how to 
obtain all the tuples, given that the system limits the number of re- 
turn tuples for each query? A naive solution is to issue a query for 
every single location in the data space (e.g., in Figure 1, the data 
space is the Cartesian product 1 of the domains of MAKE, BODY 
STYLE, PRICE, and MILEAGE), but the number of queries needed 
can obviously be prohibitive. This gives rise to an interesting prob- 
lem, as we define in the next subsection, where the objective is to 
minimize the number of queries. 

1.1 Problem Definitions 

We consider that the data space D has d attributes Ai, Ad, 
each of which has a discrete domain. Specifically, denote by 
dom(Ai) the domain of A, for each i £ [1, d] ; then, D is the Carte- 
sian product of dom(Ai), dom(Ad). We refer to each element 



'While one may leverage knowledge of attribute dependencies - 
e.g., BMW does not sell trucks in the US - to prune the data space 
into a subset of the Cartesian product, the subset is often still too 
large to enumerate. 
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of the Cartesian product as a point in D, i.e., a point is a possible 
combination of values of all dimensions. 

Depending on whether there is a total ordering on dom(Ai) or 
not, we call Ai a numeric or categorical attribute, respectively. Our 
discussion distinguishes three types of D: 

• Numeric: all the d attributes of D are numeric. 

• Categorical: all the d attributes are categorical. In this case, 
we use Ui to represent the size of dom(Ai), i.e., how many 
distinct values there are in dom(Ai). 

• Mixed: the first cat € [1, d — 1] attributes Ai, A ca t are 
categorical, whereas the other d — cat attributes are numeric. 
Similar to before, let Ui = \dom(Ai)\ for each i £ [1, cat]. 

To facilitate presentation, we consider that the domain of a numeric 
Ai to be the set of all integers, whereas that of a categorical A, to 
be the set of integers from 1 to Ui. Keep in mind, however, that the 
ordering of these values is irrelevant to a categorical Ai. 

Let D be the hidden database of a server with each element of 
D being a point in D. To avoid ambiguity, we will always refer to 
elements of D as tuples. D is a bag (i.e., a multi-set), that is, it may 
contain identical tuples. 

The server supports queries on D. As shown in Figure 1, each 
query specifies a predicate on each attribute. Specifically, if Ai is 
numeric, the predicate is a range condition in the form of 

M G [a:, y] 

where [x,y] is an interval in dom(Ai). On the other hand, for a 
categorical Ai, the predicate is: 

Ai = x 

where x is either a value in dom(Ai) or a wildcard In particular, 
a predicate Ai = * means that Ai can be an arbitrary value in 
dom(Ai), i.e., capturing BODY STYLE = ANY in Figure 1. Note 
that if a hidden database server only allows single- value predicates 
(i.e., no range-condition support) on a numeric attribute, then we 
can simply consider the attribute as categorical. 

Given a query q, denote by q(D) the bag of tuples in D qualify- 
ing all the predicates of q. The server does not necessarily return 
the entire q(D) - it does so only when q(D) is small. Formally, the 
response of the server is: 

• if \q(D) | < k: the entire q(D) is returned. In this case, we 
say that q is resolved. 

• Otherwise: only k tuples 2 in q(D) are returned, together with 
a signal indicating that q(D) still has other tuples. In this 
case, we say that q overflows. 

The value of A: is a system parameter (e.g., k — 1000 for Yahoo! 
Autos, as mentioned earlier). It is important to note that, in case a 
query q overflows, repeatedly issuing the same q may always get 
the same response from the server, and does not help to obtain the 
other tuples in q(D). 

The problem addressed by this paper is: 

Problem 1. (Hidden Database Crawling) Retrieve the 
entire D while minimizing the number of queries. 

Recall that D is a bag, i.e., it may have duplicate tuples. We re- 
quire that no point in the data space D have more than k tuples in 

2 In practice, these are usually the k tuples that have the highest pri- 
orities (e.g., according to a ranking function) among all the tuples 
qualifying the query. 



D. Otherwise, Problem 1 has no solution at all. To see this, con- 
sider the existence of k + 1 tuples t\ , t^+i in D, all of which are 
equivalent to a point p£D. Then, whenever p satisfies a query, the 
server can always choose to leave tk+i out of its response, making 
it impossible for any algorithm to extract the entire D. Note that, in 
Yahoo! Autos, the previous requirement essentially states that there 
cannot be k — 1000 vehicles in the database having exactly the 
same values on all attributes - an assumption that is fairly realistic. 

As mentioned in Problem 1, the cost of an algorithm is the num- 
ber of queries issued. This metric is motivated by the fact that, most 
systems have a control on how many queries can be submitted by 
the same IP address within a period of time (e.g., a day). Therefore, 
a crawler must minimize the number of queries to get the task done, 
besides bringing the burden of the server to the lowest level. 

We will use n to denote the number of tuples in D. It is clear that 
the number of queries needed to extract the entire D is at least n/k. 
Of course, this ideal cost may not always be possible. Hence, the 
central (technical) questions to be answered are two-fold. First, on 
the upper bound side, how to solve Problem 1 by performing only 
a small number of queries even in the worst case? Second, on the 
lower bound side, how many queries are compulsory for solving 
the problem in the worst case? 

1.2 Our Results 

This paper presents a systematic study of hidden database crawl- 
ing as defined in Problem 1 . At a high level, our first contribution is 
a set of algorithms that are both provably fast in the worst case, and 
efficient on practical data. Our second contribution is a set of lower- 
bound results establishing the hardness of the problem. These re- 
sults make explicit how the hardness is affected by the underlying 
factors, and thus reveal valuable insights into the characteristics of 
the problem. Furthermore, the lower bounds also prove that our 
algorithms are already optimal asymptotically, i.e., they cannot be 
improved by more than a constant factor. 

Our first main result is: 

THEOREM 1. There is an algorithm for solving Problem 1 
whose cost is: 

• 0(d ■ ?) when D is numeric; 

• at most Ui when D is categorical and cat = 1 (i.e., there is 
only one categorical attribute); 

• at most ^ • Y2%=i m in{£A, ? } + X^=i Ui when D is cate- 
gorical and cat > 1; 

• at most Ui + 0(d ■ ?) when D is mixed and cat — 1; 

• otherwise ( i.e., D is mixed and cat > 1 ): at most 

cat cat 
i—l i—1 

The above can be conveniently understood as follows: our algo- 
rithm pays an (additive) cost of 0(n/k) for each numeric attribute 
Ai, whereas it pays ^ • min{f/i, ?} + Ui for each categorical Ai. 
The only exception is when cat = 1: in this scenario, we pay 
merely U\ for the (only) categorical attribute A\. Notice that the 
cost on each numeric attribute is irrelevant to its domain size. 

Our second main result complements the preceding one: 

THEOREM 2. None of the results in Theorem 1 can be improved 
by more than a constant factor in the worst case. 

Besides establishing the optimality of our upper bounds in The- 
orem 1, Theorem 2 has its own interesting implications. First, it in- 
dicates the unfortunate fact that, for all types of D, the best achiev- 
able query time (in the worst case) is much higher than the ideal 
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cost of n/k (nevertheless, Theorem 1 suggests that we can achieve 
this cost asymptotically when d is a constant and all attributes are 
numeric). Second, as the number cat of categorical attributes in- 
creases from 1 to 2, the discrepancy of the time complexities in 
Theorem 1 is not an artifact, but rather, it is due to an inherent 
leap in the hardness of the problem (this is true regardless of the 
number of numeric attributes). That is, while we pay only 0(Ui) 
extra queries for the (sole) categorical attribute when cat = 1, as 
cat grows to 2 and onwards, the cost paid (by any algorithm) for 
each categorical Ai has an extra term of ? min{£/i, Given 
that the term is multiplicative, this finding implies (perhaps surpris- 
ingly) that, in the worst case, it may be infeasible to crawl a hidden 
database with a large size n, and at least 2 categorical attributes 
such that at least one of them has a large domain. 

We have performed extensive experiments to evaluate the effi- 
ciency of the proposed algorithms on real datasets, and demonstrate 
that they demand significantly fewer queries than alternative solu- 
tions. Our experimentation also reveals that the number of queries 
needed to crawl a hidden database may be far less than previously 
thought; for example, for k = 1000, around 200 queries already 
suffice for crawling a dataset containing 69,768 tuples from the hid- 
den database at Yahoo! Autos. This phenomenon suggests that, for 
a search engine, crawling a hidden database may no longer be a 
goal of tomorrow, whereas for a data provider, permitting an en- 
gine to crawl its database is not expected to impose a heavy toll on 
its workload. 

1.3 Practical Remarks 

Domain values. In our problem definition, the crawler should 
know the domains of the categorical attributes (note that this is- 
sue is irrelevant to numerical attributes, whose domains can always 
be considered to be (— oo, oo)). For some websites, the domains 
of all attributes are explicitly provided such that our algorithms can 
be applied immediately. For example, this is the case for Yahoo! 
Autos, where all the values of, say, MAKE can be seen from the 
pull-down menu of its query interface. For other websites, before 
using our technique, the crawler needs to first discover the domain 
values of categorical attributes. Domain discovery has been stud- 
ied in [15], and can be accomplished with a number of effective 
algorithms. 

Attribute dependency. As mentioned in the above discussions, be- 
cause of attribute dependencies in a practical hidden database, not 
all the points in the data space D can contain a tuple. For example, 
with proper external knowledge of the dependency between MAKE 
and BODY STYLE, one does not need to explore points with MAKE 
= BMW and BODY STYLE = TRUCK. While knowledge of at- 
tribute dependencies is not considered in this paper, please note that 
our upper bound results still hold even in scenarios where attribute 
dependencies exist. In other words, even if our algorithm may is- 
sue some unnecessary queries, the number of queries needed can 
nonetheless still be limited under the claimed bounds, as is exactly 
the merit of Theorem 1. In practice, there is an obvious heuristic 
for adapting our algorithm to account for attribute dependencies: 
the crawler issues a query demanded by our algorithm only if the 
query covers at least one valid point in D (according to the crawler's 
dependency knowledge). The query cost can only go down, i.e., 
still guaranteed to be below our upper bounds. 

Acquisition of knowledge about attribute dependencies requires 
dedicated efforts to analyze the hidden database at a particular web- 
site - efforts that obviously cannot be afforded by the crawler for all 
websites. The implication is thus that the crawler may not be aware 
of the latent attribute dependencies at many of the websites being 
crawled. This is where our lower bound results come into place: 



they indicate the ultimate worst-case efficiency that the crawler can 
possibly achieve in these environments: a piece of information vital 
for the crawler's design. 

1.4 Previous Work 

A significant body of research has been carried out on how to ex- 
tract, integrate, and analyze data from the deep web, a general term 
referring to the entire collection of online information unreachable 
by search engines (including, but not limited to, hidden databases). 
As explained below, however, the issues that have been addressed 
by the existing work are all orthogonal to this paper. 

Most relevant to our work are the previous studies on crawl- 
ing hidden text-based [1,5, 18,20] and structured [2,7, 16, 17, 19] 
databases. The focus of those studies is how to formulate queries 
to retrieve meaningful results. More specifically, the primary chal- 
lenge in [1,5, 18,20], where the query interface is a keyword-based 
(Google-like) form, is to discover legitimate query keywords. On 
the other hand, the main objective in [2, 7, 16, 17, 19], where the 
query interface is an HTML-form like the one in Figure 1, is to ex- 
pose combinations of input values suitable for filling in the form. In 
this paper, we are not concerned with mining effective queries, but 
instead, attack directly how to acquire a complete hidden database 
with the smallest number of queries (see Problem 1). 

Also relevant is the literature of data analytics in deep web. In 
this vein, a main stream is to investigate how sampling can be 
deployed to perform, for example, content summary generation 
[8, 14], top-A: retrieval [6], aggregate estimation [9], measurement 
of various metrics of search engines [3,4], and so on. The crawling 
techniques proposed in this paper aim at enabling a much broader 
class of applications (e.g., virtually any query on the database, as 
described earlier), which otherwise would not be possible if only a 
sample of the hidden database could be obtained. 

It is worth mentioning that, there has been considerable research 
on other problems related to the deep web, which, however, are 
only remotely related to our work. While a complete survey is out 
of the scope of this paper, entry points for further reading can be 
found in [10, 21] on parsing and understanding web interfaces, in 
[13] on attribute mapping across different interfaces, and in [1 1, 12] 
for integrating the query interfaces of multiple hidden databases. 

2. NUMERICAL ATTRIBUTES 

This section will explain how to solve Problem 1 when the data 
space D is numeric. In Section 2.1, we first define some atomic 
operators, and present an algorithm that is intuitive, but has no at- 
tractive performance bounds. Then, in Sections 2.2 and 2.3, we 
present another algorithm to achieve the optimal performance. 

2.1 Basic Operations and Baseline Algorithm 

Recall that, in a numeric D, the predicate of a query q on each 
attribute is a range condition. Thus, q can be regarded as a d- 
dimensional (axis-parallel) rectangle, such that its result q(D) con- 
sists of the tuples of D covered by that rectangle. If the predicate of 
q on attribute Ai (i G [1, d]) is Ai £ [xi, X2], we say that [a;i,a;a] 
is the extent of the rectangle of q along Ai. Henceforth, we may 
use symbol q to refer to its rectangle also, when no ambiguity can 
be caused. Clearly, settling Problem 1 is equivalent to determining 
the entire q(D) where q is the rectangle covering the whole D. 

Split. A fundamental idea to extract all the tuples in q(D) is to 
refine q into a set S of smaller rectangles, such that each rectangle 
q' 6 S can be resolved (i.e., q'(D) has at most k tuples). Note that 
this always happens as long as rectangle q' is sufficiently small - 
in the extreme case, when q' has degenerated into a point in D, the 
query q' is definitely resolved (otherwise, there would be at least 
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Figure 2: Splitting 

fc + 1 tuples of D at this point). Therefore, a basic operation in our 
algorithms for Problem 1 is split, as described next. 

Given a rectangle q, we may perform two types of splitting, de- 
pending on how many rectangles q is divided into: 

• 2-way split: Let [xi , X2] be the extent of q on Aj (for some 
i 6 [1, d\). A 2-way split at a value x £ [^1,2:2] partitions 
q into rectangles qi e f t and q r i g ht, by dividing the A;-extent 
of q at x. Formally, on any attribute other than Ai, qieft and 
q r ight have the same extents as q. Along A;, however, the 
extent qi e ft is [xi, x — 1], whereas that of q r i g ht is [x, £2]. 
Figure 2a illustrates the idea by splitting on the horizontal 
attribute. 

• 3-way split: Let [xi, X2] be defined as above. A 3-way split 
at a value x £ [:ri,a;2] partitions q into rectangles qieft, 
q m id and q r i g ht as follows. On any attribute other than Ai, 
they have the same extent as q. Along Ai, however, the ex- 
tent of qi e ft is [xi,x— 1], that of q m id is [x, x], and that of 
qright is [a; + 1, x 2 \. See Figure 2b. 

In the sequel, a 2-way split will be abbreviated simply as a split. 
No confusion can arise as long as we always mention 3-way in re- 
ferring to a 3-way split. The extent of a query q on an attribute Ai 
can become so short that it covers only a single value, in which case 
we say that Ai is exhausted on q. For instance, the horizontal at- 
tribute is exhausted on q m id in Figure 2b. It is easy to see that there 
is always a non-exhausted attribute on q unless q has degenerated 
into a point. 

Binary shrink. Next, we describe a straightforward algorithm for 
solving Problem 1, which will serve as the baseline approach for 
comparison. This algorithm, named binary-shrink, repeatedly per- 
forms (2-way) splits until a query is resolved. Specifically, given a 
rectangle q, binary-shrink runs the rectangle (by submitting its cor- 
responding query to the server) and finishes if q is resolved. Oth- 
erwise, the algorithm splits q on an attribute Ai that has not been 
exhausted, by cutting the extent [xi, X2] of q along Ai into equally 
long intervals (i.e., the split is performed at x = \(xi + X2)/2~\). 
Let qieft , Qright be the queries produced by the split. The algorithm 
then recurses on qieft and q r i g ht, respectively. 
Remark. It is obvious that the cost of binary-shrink (i.e., the num- 
ber of queries issued) depends on the domain sizes of the (numeric) 
attributes of D, which can be unbounded. In the following subsec- 
tions, we will improve this algorithm to optimality. 

2.2 One-Dimensional Case 

Before giving our ultimate algorithm for settling Problem 1 of 
any dimensionality d, in this subsection we first explain how it 
works for d — 1. This will clarify the rationale behind the al- 
gorithm's efficiency, and facilitate our analysis for a general d. It is 
worth mentioning that the presence of only one attribute removes 
the need to specify the split dimension in describing a split. 

Rank-shrink. Our algorithm, named rank-shrink, differs from 
binary-shrink in two ways. First, when performing a (2-way) split, 



ti 
-Ci- 
lO 



<2 

-o- 

20 



h U 
30 35 

g2 



*5 h {h, t 8 ) 



15 



55 

93 qi 



16 



i'/l ] 
I 

(§) x @ (g) 
\ 

29 ® 

(a) Dataset D and queries (b) Recursion tree 

Figure 3: Illustration of Id rank-shrink 

instead of cutting the extent of a query q in half, we aim at ensuring 
that at least fc/4 tuples fall in each of the rectangles generated by 
the split. Such a split, however, may not always be possible, which 
as we will see can happen if many tuples are identical to each other. 
Hence, the second difference that rank-shrink makes is to perform a 
3-way split in such a scenario, which gives birth to a query (among 
the 3 created) that can be immediately resolved. 

Formally, given a query q, the algorithm eventually returns q(D). 
It starts by issuing q to the server, which returns a bag R of tuples. 
If q is resolved, the algorithm terminates by reporting 7?. Other- 
wise (i.e., q overflows), we sort the tuples of R in ascending order, 
breaking ties arbitrarily. Let o be the (fc/2)-fh tuple in the sorted 
order, with its Ai-value being x. Now, we count the number c of 
tuples in R identical to o (i.e., R has c tuples with Ai-value x), and 
proceed as follows: 

• Case 1: c < fc/4. Split q at x into qieft and q r i g ht, each of 
which must contain at least fc/4 tuples in R. To see this for 
qieft (symmetric reasoning applies to q r ight), note there are 
at least fc/2 — c > fc/4 tuples of R strictly smaller than x, all 
of which fall in qieft - The case for q r i g ht follows in analogy. 

• Case 2: c > fc/4. Perform a 3-way split on q at x. Let 
qieft, q m id and q r i g ht be the resulting rectangles (note that, 
the ordering among them matters; see Section 2.2). Observe 
that q m id has degenerated into point x, and therefore, can 
immediately be resolved. 

As a technical remark, in Case 2, x might be the lower 
(resp. upper) bound 3 on the extent of q. If this happens, we 
simply discard quft (resp. q r i g ht) as it would have a mean- 
ingless extent. 

In either case, we are left with at most two queries (i.e., qieft and 
bright) to further process. The algorithm handles each of them re- 
cursively in the same manner. 

Example. We use the dataset D in Figure 3a to demonstrate the 
algorithm. Let fc = 4. The first query is gi = (—00, 00). Suppose 
that the server responds by returning Ri = {ti, te, £7, is} and a 
signal that qi overflows. The (fc/2) = 2-nd smallest tuple in R\ is 
t& (after random tie breaking), whose value is x — 55. As Ri has 
c = 3 tuples with value 55 and c > fc/4 = 1, we perform a 3-way 
split on qi at 55, generating 52 = (— 00, 54], 173 = [55,55] and 
§4 = [56, 00). As 53 has degenerated into a point, it is resolved 
immediately, fetching te, tt and t&. These tuples have already been 
extracted before, but this time they come with an extra fact that no 
more tuple can exist at point 55. 

Let us look at q 2 . Suppose that the server's response is R2 = 
t 2 , ti, £5}, plus an overflow signal. Hence, x — 20 and c = 1. 
Thus, a two-way split on q 2 at 20 creates q$ = (—00, 19] and qe = 
[20, 54]. Queries qi, 95 and qe are all resolved. 

Analysis. The lemma below bounds the cost of rank-shrink. 

LEMMA 1. When d = 1, rank-shrink requires 0(n/k) queries. 



i x cannot be both because otherwise q would be a point and there- 
fore could not have overflown. 
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PROOF. The main tool used by our proof is a recursion tree T 
that captures the spawning relationships of the queries performed 
by rank-shrink. Specifically, each node of T represents a query. 
Node u is the parent of node v! if query u' is created by a (2-way 
or 3-way) split of query u. Each internal node thus has 2 or 3 child 
nodes. Figure 3b shows the recursion tree for the queries performed 
in our earlier example on Figure 3a. 

We focus on bounding the number of leaves in T because it dom- 
inates the number of internal nodes. Observe that each leaf v corre- 
sponds to a disjoint interval in dom(Ai), due to the way splits are 
carried out. There are three types of v: 

• Type- 1 : the query represented by v is immediately resolved 
in a 3-way split (i.e., q m id in Case 2). The interval of v 
contains at least k/4 (identical) tuples in D. 

• Type-2: query v is not type-1, but also covers at least fc/4 
tuples in D. 

• Type-3: query v covers less than fc/4 tuples in D. 

For example, among the leaf nodes in Figure 3, Q3 is of type 1, q$ 
and qs are of type 2, and q± is of type 3. 

As the intervals of various leaves cover disjoint bags of tuples, 
the number of type-1 and -2 leaves is at most ^yj = 4n/fc. Each 
leaf of type-3 must have a sibling in T that is a type-2 leaf (i.e., in 
Figure 3, such a sibling of 94 is qs). On the other hand, a type-2 
leaf has at most 2 siblings. It thus follows that there are at most 
twice as many type-3 leaves as type-2, i.e., the number of type-3 
leaves is no more than 8n/k. This completes the proof. 

We remark that the above analysis implies that (quite loosely) 
T has no more than 4n/fc + 8n/k = Y2n/k leaves. Thus, there 
cannot be more than this number of internal nodes in T. □ 

2.3 Rank-Shrink for Higher Dimensionality 

We are now ready to extend rank-shrink to handle any d > 1. In 
addition to the ideas exhibited in the preceding subsection, we also 
apply an inductive approach: converting the d-dimensional prob- 
lem to several (d — 1) -dimensional ones. Our discussion below 
assumes that the (d — 1) -dimensional problem has already been 
settled by rank-shrink. 

Given a query q, the algorithm (as in Id) sets out to solicit the 
server's response R, and finishes if q is resolved. Otherwise, it 
examines whether Ai is exhausted in q, i.e., whether the extent 
of q on Ai has only 1 value, say x, in dom(Ai). If so, we can 
then focus on attributes A2, Ad- This is a (d — 1) -dimensional 
version of Problem 1, in the (d — 1) -dimensional subspace covered 
by the extents of q on A2, Ad, eliminating Ai by fixing it to x. 
Hence, we invoke rank-shrink to solve it. 

Consider that Ai is not exhausted on q. Similar to the Id al- 
gorithm, we will split q such that, either every resulting rectangle 
covers at least k/4 tuples in R, or one of them can be immediately 
solved as a (d — 1) -dimensional problem. The splitting proceeds 
exactly as described in Cases 1 and 2 of Section 2. The only differ- 
ence is that the rectangle q m id in Case 2 is not a point, but instead, 
a rectangle on which Ai has been exhausted. Hence, q m id is pro- 
cessed as a (d — 1) -dimensional problem with rank-shrink. 

As with the Id case, the algorithm recurses on qieft and q r i g ht 
(provided that they have not been discarded for having a meaning- 
less extent on Ai). 

Example. We demonstrate the algorithm using the 2d dataset in 
Figure 4, where D has 10 tuples tx,...,tio. Let k = 4. The 
first query qi issued covers the entire data space. Suppose that 
the server responds with Ri — {<4, ti, t$, tg} and an overflow sig- 
nal. We 3-way split q\ at A\ = 80 into qi, <?3 and g4, whose 
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Figure 4: Illustration of 2d rank-shrink 

rectangles can be found in Figure 4. Specifically, the Ai -extents of 
qi, q3,qi are (—00, 79], [80, 80], [81, 00) respectively, while their 
yl^-extents are all (—00, 00). Note that A\ is exhausted on qi; al- 
ternatively, we can see that q2 is equivalent to a Id query on the 
vertical line Ai = 80. Hence, qi is recursively settled by our Id 
algorithm (which, as can be verified easily, requires 3 queries). 

Suppose that the server's response to qi is Ri = {t2,tz,tn, ts} 
and an overflow signal. Accordingly, qi is split into qs and q& at 
Ai = 40, whose rectangles are also shown in Figure 4. Finally, 
(?4 , qs and qa are all resolved. 

Analysis. We have the lemma below for general d: 

LEMMA 2. Rank-shrink performs O(dnfk) queries. 

PROOF. The case d — 1 has been proved in Lemma 1. Next, 
assuming that rank-shrink issues at most a(d — l)n/k queries for 
solving a (d — 1) -dimensional problem with n tuples (where a is a 
positive constant), we will show that the cost is at most adn/k for 
dimensionality d. 

Again, our argument leverages a recursion tree T. As before, 
each node of T is a query, such that node u parents node u' , if 
query u' was created from splitting u. We make a query v a leaf of 
T as soon as one of the following occurs: 

• v is resolved. We associate i) with a weight set to 1 . 

• Ai is exhausted on rectangle v. Recall that such a query 
is solved as a (d — 1) -dimensional problem. We associate 
v with a weight, equal to the cost for rank-shrink for that 
problem. 

For our earlier example with Figure 4, the recursion tree T happens 
to be the same as the one in Figure 3b. The difference is that each 
leaf has a weight. Specifically, the weight of 53 is 3 (i.e., the cost 
of solving the Id query at the vertical line Ai — 80 in Figure 4), 
and the weights of the other leaves are 1 . 

The total cost of rank-shrink on the d-dimensional problem, 
therefore, equals the total number of internal nodes in T, plus the 
total weight of all the leaves. 

As the Ai-extents of the leaves' rectangles have no overlap, their 
rectangles cover disjoint tuples. Let us classify the leaves into 
types- 1, -2 and -3 as in the proof of Lemma 1, by adapting the 
definition of type-1 in a straightforward fashion: v is of this type if 
it is the middle node qmid from a 3-way split. Each type-3 leaf has 
weight 1 (as its corresponding query must be resolved). As proved 
in Lemma 1, the number of them is no more than 8n/k. 

Let vi, ...,vp be all the type-1 and -2 nodes (i.e., suppose the 
number of them is /3). Assume that node Vi contains n; tuples of 
D. It holds that Y2t=i ni — l-^l = n - The weight of Vi, by our 
inductive assumption, is at most a(d — l)rii/k. Hence, the total 
weight of all the type-1 and -2 nodes does not exceed a(d— l)n/k. 
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The same argument in the proof of Lemma 1 shows that T has 
less than 12n/k internal nodes. Thus, summarizing the above 
analysis, the cost of d-dimensional rank-shrink is no more than: 
±f^ + j± + a(d - l)f = (20 + a(d - l))f . To complete our 
inductive proof, we want (20 + a(d — l))f to be bounded from 
above by adn/k. This is true for any a > 20. □ 

Remark. This concludes the proof of the first bullet of Theorem 1 . 
We point out that when d is fixed value (as is true in practice), the 
time complexity in Lemma 2 becomes 0(n/k), that is, asymptot- 
ically matching the trivial lower bound n/k. A natural question 
at this point is, if d is not constant, is there an algorithm that can 
still guarantee cost 0(n/k)l In Section 4, we will show that this is 
impossible. 

3. CATEGORICAL ATTRIBUTES 

We proceed to solve Problem 1 when the data space D is cat- 
egorical. Recall that, as mentioned in Section 1.1, the domain 
dom(Ai) of the i-th attribute Ai is the set of integers in [1, Ui] 
(where Ui = \dom(Ai)\), although it should be understood that 
the ordering of those integers is irrelevant. We will again first (in 
Section 3.1) clarify some preliminary concepts and give a baseline 
algorithm, before presenting the proposed solution (in Section 3.2). 

3.1 Data Space Tree and Depth First Search 

Unlike range predicates on numeric attributes, the predicate sup- 
ported by the server on a categorical attribute Ai (1 < i < d) is an 
equality constraint of the form Ai = x, where x is either a value in 
dom(Ai) or a wildcard *. This difference prompts us to adopt an 
alternative approach to attack Problem 1. Instead of performing 2- 
or 3-way splits (as in the numeric case), we instead enumerate the 
points in D. This idea looks drastic at first glance - D has a total of 
IIi=i Ui points, where Ui is the domain size of Ai. A naive way to 
enumerate the entire D is clearly intractable. It turns out, interest- 
ingly, that we can significantly reduce the cost from the formidable 
n =1 Ui to only roughly a linear term Y2 =i Ui. 
Data space tree. Let us start by arranging all the points of D into 
a tree T, which we refer to as the data space tree. Each node u of 
T represents a subspace enclosing all the points of D satisfying a 
condition like: 



Ai = ci, At = c e , At 



Ai 



where I is an integer in [0, d] and c\ , q are not wildcards. We 
say that u is at level I. We will refer to the condition of it as 
query (u) because the condition obviously corresponds to a query 
that can be submitted to the server. 

If we look at a point ( c\ , . . . , a ) in D as a string concatenating all 
its coordinates from the first to the d-th attribute, T can be regarded 
as a trie on all the Yli=i Ui strings. Formally, the root of D is at 
level 0, and thus represents the entire D (equivalently, the condition 
of the root specifies a wildcard on all dimensions). In general, if 
it is a levels node (0 < £ < d — 1), it has Ut+i child nodes of 
level £ + 1. The z-th (1 < i < Ui+i) child v of u is such that, 
query(v) agrees with query(u) on all dimensions, except Ae+i 
on which query(v) specifies Ae + i = i. That is, v refines u on 
Ai+i by setting this attribute to i. Each leaf of T is at level d and 
represents a distinct point in D. 

To illustrate, Figure 5a shows a dataset D with 10 tuples 
ti, tio in a 2d space D where each dimension has domain size 
4. Figure 5b demonstrates the data space tree T (the subtrees of 
nodes 143 and 114 are omitted for simplicity). Node ui, for instance, 
is associated with a query (tii) that has predicates Ai = * and 
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Figure 5: Illustration of categorical algorithms 

A2 — *, whereas query{u2) has predicates A\ = 1 and A2 = *, 
and query(u7) has A\ = 1 and A2 = 2. 

The following lemma presents a fact that will be useful later. 

LEMMA 3. Let u, v be two nodes at the same level ofT. No 
tuple of D can satisfy query (u) and query (v) at the same time. 

PROOF. Let the level be I. The conditions in query(u) and 
query(v) differ in their predicates on at least one attribute Ai for 
some i G [1, l\. No tuple can satisfy those predicates simultane- 
ously. □ 

Depth first search (DFS). We now describe an algorithm, named 
DFS, that serves as the baseline approach. The algorithm simply 
traverses T in a depth-first manner. For each node u in T, it sends 
query(u) to the server, and acquires the bag R of tuples returned. 
As a pruning rule, if query(u) is resolved, the subtree of u no 
longer needs to be explored, as all the tuples in the subtree are 
already in R. If, on the other hand, query(u) overflows, the algo- 
rithm processes each child of u in the same manner. 

Suppose k — 3. On the input of Figure 5a, DFS examines the 
nodes in the order ui, 112, us, 117, ... To see an example of pruning, 
consider the moment when DFS is at U3. Since query(us) is re- 
solved (the query has predicates A\ = 2 and Ai = *, and returns 
only ts), the subtree of U3 can thus be eliminated. It can be verified 
that DFS eventually visits all of Ui, U13. 

Remark. Not surprisingly, DFS incurs expensive query cost in the 
worst case. We omit its analysis because it is tedious, and yet this 
algorithm is not the one advocated in this paper. In the next sub- 
section, we give a better algorithm with the optimal performance. 

3.2 Algorithm Slice-Cover 

Slice query. We say that a query q is a slice query if its predicates 
have the form: 



.,Ai 



Ai — c, Ai-\ 



where c is a value in dom(Ai). Namely, the query has a wildcard 
predicate on all but one attribute Ai for some i G [1, d]. We use the 
notation Ai = c to uniquely refer to a slice query. Clearly, varying 
c in dom(Ai) defines Ui slice queries, such that the total number 
of slice queries of all dimensions is 5^ =1 Ui. In the example of 
Figure 5a, there are totally 8 slice queries. 

Slice-cover. Now we describe an algorithm, named slice-cover, 
for solving Problem 1. The algorithm runs in two phases. In the 
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Figure 6: Lookup table of slice queries (fe = 3) 

preprocessing phase, we simply submit every slice query to the 
server, and record its response locally in a lookup table as follows. 
If a slice query q is resolved, its result q(D) (which has at most k 
tuples) is entered in the table, whereas if q overflows, we remember 
nothing but a bit indicating that |g(-D)| > k. Figure 6 presents the 
contents of the lookup table for the example in Figure 5, assuming 
k = 3. 

The second phase of slice-cover executes an algorithm, named 
extended-DFS, by supplying the root of the data space tree T as the 
parameter. In general, given a node it in T, extended-DFS returns 
all the tuples in D satisfying query (it) (hence, setting u to the 
root of T settles Problem 1). Extended-DFS also performs a depth- 
first traversal of T (similar to the DFS algorithm of Section 3.1), 
but leverages the contents of the lookup table to boost the pruning 
effectiveness. 

Without loss of generality, suppose that u is at level I (for some 
I G [0, d]). Extended-DFS starts by sending query(u) to the 
server. If query(u) is resolved, the algorithm ends by reporting 
the result received from the server, because the subtree of u can be 
eliminated (as explained in Section 3.1 for DFS). Next, we focus 
on the more interesting case where query(u) overflows, implying 
that u is an internal node of T. 

As with DFS, extended-DFS may access each child node v of 
u. However, before doing so, it attempts to answer query(v) 
locally using the lookup table (i.e., without bothering the server 
with another query). To explain, recall that v, which is at level 
£ + 1, is associated with a query(v) that refines query(u). That 
is, query(v) inherits the predicates of query(u) on all attributes, 
except A( + i on which query(v) has a predicate, say, Ae + i = c 
for some c G dom(At+i). 

Let us observe that the result of query(v) is completely con- 
tained in the result of the slice query A( + i = c. Remember that 
the server's response, say R, to the slice query is already avail- 
able in our lookup table. Therefore, we fetch R from the table 
(at no cost), and see whether the slice query was resolved. If yes, 
query (v) can be accurately answered by returning the tuples in R 
satisfying query (v), in which case the subtree of v does not need 
to be explored further. If, on the other hand, R shows that the slice 
query overflew, extended-DFS recursively processes v in the same 
manner. 

Example. We illustrate extended-DFS using the example of Fig- 
ure 5. Let k = 3. The lookup table output by the preprocessing 
stage is in Figure 6. 

The algorithm starts with query{u\), where tti is the root of T 
(Figure 5b). Even without sending query (tti) to the server, we 
know that it overflows for sure, because at least one slice query 
overflew in preprocessing, as is clear from the (lookup) table. Fo- 
cusing on the first child U2 of Mi, extended-DFS inspects the table 
to decide whether query (112) can be answered locally. The inspec- 
tion examines slice query A\ = 1, i.e., the predicate by which 
query(u2) refines query(u\). The table indicates that the slice 
query overflew. Hence, we recursively apply extended-DFS on U2 ■ 

At «2, the algorithm checks the table as to whether querying) 
can be answered locally. This time, we focus on slice query A2 = 1 
(i.e., the extra predicate in query(ua) compared to query(u2)), 
which turns out to be resolved. Hence, query(u§) is directly an- 
swered from the result {ti,t@} of the slice query (i.e., returning 



only £1, as to does not qualify query{u$)). Thus, extended-DFS 
does not recurse into u§. Similarly, query (ur), querying,) and 
query(u§) can all be answered locally. 

We now backtrack to the root of iti, and turn attention to the 
second child 113 of Mi. A lookup is carried out to see whether 
query{uz) can be acquired from the table. The answer is yes (us- 
ing the result of slice query Ai — 2). Hence, the subtree of M3 is 
not explored further. The rest of the execution proceeds in the same 
manner. Overall, besides ui, extended-DFS is also (recursively) in- 
voked on U2 and U4. No query is ever issued to the server in the 
entire process. 

Heuristic. Next, we give a heuristic that does not affect the worst- 
case cost of slice-cover, but can improve its performance on real 
data. The motivating rationale is that, some slice queries executed 
in the pre-processing phase may not eventually be needed. In any 
case, even if a slice query does need to be consulted by the algo- 
rithm, there is no harm to run the query at the first time such a need 
arises, and register the server's response in the lookup table. If the 
slice query is consulted for a second time, (as before) the query 
does not need to be re-issued, for its result is already available in 
the table. This allows us to get rid of the entire preprocessing phase. 
We refer to the algorithm equipped with this heuristic as lazy-slice- 
cover. 

Analysis. We bound the performance of slice-cover with the fol- 
lowing result, which also applies to lazy-slice-cover as it does not 
require any more query than slice-cover. 

LEMMA 4. Ifd> 1, slice-cover performs at most $^f =1 Ui + 
5 ■ 2~2i=i mm{£^i, j;} queries. If d = 1, the number of queries is 

Proof. For d — 1, slice-cover terminates right after the pre- 
processing phase, and hence, issues Ui queries. For d > 2, the pre- 
processing phase obviously issues Y2t=i ^ i queries- Next, we will 
show that extended-DFS incurs cost at most £ ■ 2~2i=i mm {K, 

Let T be the nodes of T on which extended-DFS is invoked (i.e., 
T includes the root of T, and the nodes extended-DFS recurses 
into). For instance, on the example of Figure 5, our earlier discus- 
sion showed that T includes nodes Ui, U2 and ua. The number of 
nodes in T is an upper bound of the number of queries extended- 
DFS performs. 

For each i G [1, d], let Si be the set of level-i nodes in T. The 
following analysis will prove that \Si\ < ? • min{Ui, ?■}, which 
is enough to complete the proof. To bound \Si\, we consider Si-i 
(i.e., one level closer to the root). We will show: 

• Fact 1: at most n/k nodes in Si-i are internal in T. 

• Fact 2: each internal node in Si-i can have at most 
min{[/i, ?} child nodes in T. 

Combining both facts gives the desired upper bound ? ■ 

min{Ui,%} on \Si\. 

Proof of Fact 1. For a node u G Si-i, denote by n u the number 
of tuples in D satisfying query(u). It follows from Lemma 3 that 



n u < \D\ 



On the other hand, as u is an internal node in T, n u > k. Other- 
wise, query(u) would have been resolved, in which case the sub- 
tree of u should have been pruned in extended-DFS, contradicting 
u being an internal node. This, together with X] u es i n u < n, 
proves that Si-i has no more than n/k internal nodes. 

Proof of Fact 2. For each node u G Si-i, let children(u) 
be the set of child nodes of u in T. By the way T is defined, 
\children(u)\ = Ui. As it cannot have more child nodes in T 
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than in T, Fact 2 holds if Ui < n/k. The rest of the proof assumes 

Ui > n/k. 

For each node v 6 children(u), query(v) refines query(u) by 
replacing the predicate on Ai (which is Ai = * in query(u)) with 
Ai = c for some c G dom(Ui). Furthermore, each v specifies a 
different c. Extended-DFS guarantees that v should not be accessed 
(and hence, should not belong to T) if the slice query Ai = c is 
resolved. In other words, if we denote by Q the set of slice queries 
of the form Ai — c (i.e., \Q\ = Ui), the number of child nodes of 
u in T is at most the number of queries in Q that overflow. 

Clearly, no tuple of D can satisfy both slice queries Ai = Ci and 
Ai — C2 as long as ci 7^ C2. In other words, the sum of |<j(-D)| 
(i.e., the number of tuples satisfying q) of all q £ Q is at most n. 
On the other hand, \q(D) \ > k if q overflows. Thus, the number of 
overflowing queries in Q is at most n/k. □ 

Remark. Lemma 4 establishes the second and third bullets of The- 
orem 1 . As we will see in the next section, the upper bounds in the 
lemma cannot be improved by more than a constant factor in the 
worst case. 

4. LOWER BOUND RESULTS 

This section turns away from upper bounds, and focuses on the 
hardness of Problem 1. In Section 4.1 (4.2), we will give asymp- 
totic lower bounds on how many queries are needed for settling the 
problem, when the data space D is numeric (categorical). 

4.1 Numeric Attributes 

The objective of this subsection is to establish: 

THEOREM 3. Let k, d, m be arbitrary positive integers such 
that d < k. There is a dataset D (in a numeric data space ) with 
n — m(k + d) tuples such that, any algorithm must use at least 
dm queries to solve Problem 1 on D. 

It is, therefore, impossible to improve our algorithm rank-shrink 
(see Lemma 2) by more than a constant factor in the worst case, as 
shown below: 

COROLLARY 1. In a numeric data space, no algorithm can 
guarantee solving Problem 1 with o(dn/k) queries. 

PROOF. If there existed such an algorithm, let us use it on the 
inputs in Theorem 3. The cost is o(dn/k) = o(dm(k + d)/k) 
which, due to d < k, is o(dm), causing a contradiction. □ 

We now proceed to prove Theorem 3, using a hard dataset D as 
illustrated in Figure 7. The domain of each attribute is the set of 
integers from 1 to m + 1, namely, D = [1, m + l] d . D has m 
groups of d + k tuples. Specifically, the i-th (1 < i < m) group 
has k tuples at the point (i, i), taking value i on all attributes. 
We call them diagonal tuples. Furthermore, for each j £ 
Group i also has a tuple that takes value i + 1 on attribute Aj , and i 
on all other attributes. Such a tuple is referred to as a non-diagonal 
tuple. Overall, D has km diagonal and dm non-diagonal tuples. 

Let S be the set of dm points in D that are equivalent to the 
dm non-diagonal tuples in D, respectively (i.e., each point in S 
corresponds to a distinct non-diagonal tuple). As explained in Sec- 
tion 2.1, each query can be regarded as an axis-parallel rectangle in 
D. With this correspondence in mind, we observe the following for 
any algorithm that correctly solves Problem 1 on D: 

LEMMA 5. When the algorithm terminates, each point in S 
must be covered by a distinct resolved query already performed. 
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Figure 7: A hard numeric dataset 

PROOF. Every point p £ S must be covered by a resolved query. 
Otherwise, p is either never covered by any query, or covered by 
only overflowing queries. In the former case, the tuple of D at p 
could not have been retrieved, whereas in the latter, the algorithm 
could not rule out the possibility that D had more than one tuple at 
p. In neither case could the algorithm have terminated. 

Next we show that no resolved query q covers more than one 
point in S. Otherwise, assume that q contains pi and pi in S, 
in which case q fully encloses the minimum bounding rectangle, 
denoted as r, of pi and p2- Without loss of generality, suppose that 
pi (pj) is from Group i (j) such that i < j. If i = j, then r contains 
the point (i, i), in which case at least k + 2 tuples satisfy q (i.e., 
pi , P2 and the k diagonal tuples from Group i). Consider, on the 
other hand, i < j. In this scenario, the coordinate of p\ is at most 
i + 1 < j on all attributes, while the coordinate of P2 is at least 
j on all attributes. Thus, r contains the point (j, j), causing 
at least k + 2 tuples to satisfy q (i.e., pi,p2 and the k diagonal 
tuples from Group j). Therefore, q must overflow in any case, i.e., 
a contradiction. □ 

The lemma indicates that at least |S| = dm queries must be 
performed, which validates the correctness of Theorem 3. 

4.2 Categorical Attributes 

First, if d = 1, Ui is a trivial lower bound on the cost of solving 
Problem 1. The reason is that, as long as D has more than k tuples, 
we must issue a query Ai = c for every c £ dom(Ai) to verify 
whether a tuple exists at point c. For d > 1, this subsection will 
establish: 

THEOREM 4. Let k, d, U be positive integers satisfying dU 2 < 
2 d/A , U > 3, k > 3, and d = 2k. There is a dataset D with 
n — dU tuples in a d-dimensional categorical space, where each 
attribute has a domain size U, such that, any algorithm must use 
fl(dU 2 ) queries to solve Problem 1 on D. 

We thus know that our algorithm slice-cover (see Lemma 4) can- 
not be improved by more than a constant factor in the worst case, 
as shown below: 

COROLLARY 2. In a categorical data space, no algo- 
rithm can guarantee solving Problem 1 with o(y^-_ 1 Ui + 
t J2i=i rain{Ui, f }) queries. 
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Figure 8: A hard categorical dataset 



PROOF. In the setting of Theorem 4, £ = ^f- = 2f7. Further- 
more, *7i = ... = (7 d = U. Hence, § £? =1 mm{Ui, § } = 



f 1 ' du = 



2dU . Thus, the complexity in the corollary is 
o(dU + dU 2 ) = o(dU 2 ). Hence, if the corollary was wrong, 
there would be an algorithm solving Problem 1 on the inputs in 
Theorem 4 with o(dU 2 ) queries, which is a contradiction. □ 

The rest of the subsection serves as the proof of Theorem 3. Our 
discussion is based on a hard dataset D illustrated in Figure 8. For- 
mally, D consists of U groups, each of which has d tuples. In the 
7-th (0 < i < U - 1) group, for each attribute Aj (1 < j < d), 
there is a tuple that takes value (i + 1) mod U on Aj, and value i 
on all the other d — 1 attributes. In other words, the data space D is 
[0, U — l] d , although readers should be reminded that using inte- 
gers to represent the values of a (categorical) attribute is purely for 
convenience, and that the ordering of those integers is irrelevant. 

As before, we say that a query covers a point p G D if p satisfies 
the query. The next lemma gives an important fact: 

LEMMA 6. If an algorithm has solved Problem I on our con- 
structed D, then every point in D must be covered by at least one 
resolved query already performed. 

PROOF. We say that a point in D is empty if D has no tuple at 
that point. Assume the existence of a point p G D that is not cov- 
ered in any resolved query. Hence, p is either outside all the queries 
the algorithm issued, or is covered only by the overflowing queries. 
Hence, if p was empty, the algorithm got no hint as to whether a 
tuple exists at p, and therefore, could not have terminated. On the 
other hand, if p was not empty, the algorithm saw a tuple at p but 
could not decide whether D had any other tuple at p (i.e., dupli- 
cates). In this case, the algorithm could not have terminated either. 
Thus we have a contradiction. □ 

We will show that any correct algorithm must perform Q(dU 2 ) 
resolved queries. Recall that, on each attribute Ai (1 < i < d), 
a query q has either a wildcard predicate Ai = *, or a constant 
predicate Ai — c for some c G [0,(7—1]. We say that q is 
diverse, if it has at least two non-wildcard predicates with different 
constants specified. For example, the query with predicates A\ = 
1,A2 — 2,Az = *, Ad = * is diverse, and so is the query 
with A-t = 1,A 2 = 1, A3 = 2, Aa = *, A d = * (due to the 
predicates on A2 and A3), whereas the query with Ai = 1, A2 = 
1, A3 = Ad — * is not (as the same constant appears in the 
non-wildcard predicates). 

LEMMA 7. A diverse query q has at most two qualifying tuples, 
and hence, is always resolved (since k > 2). 



PROOF. Let ci 7^ C2 be constants, each of which appears in a 
non-wildcard predicate of q. Suppose ci < C2. Two tuples from 
different groups cannot satisfy q simultaneously. Otherwise, as- 
sume that tuple ti from group i and tuple ti from group j qualify 
q, and that i < j without loss of generality. As t\ contains only 
i and i + 1 in its attributes, we know ci = i and c% = i + 1. If 
j < U — 1, ti contains only j and j + 1 in its attributes. We thus 
require ci = j = i, which violates i<j.lfj = U— 1, ti contains 
only and U — 1 in its attributes. In this case, ci = = i and 
C2 = U — 1 = i + 1, which is also impossible because U > 2. 

On the other hand, it is easy to see that, no three tuples from the 
same group can together take value ci on one attribute, and also, 
value C2 on another attribute. The lemma then follows. □ 

We say that a query q is monotonic, if (i) q has at least two non- 
wildcard predicates, and (ii) the same constant is specified in all the 
non- wildcard predicates. For example, the query with predicates 
A\ — 1,A2 — 1,A3 — Ad — ★ is monotonic, whereas 

the query with Ai = 1, A2 = 2, A3 — *, Ad — * is not (as 
different constants are used in the non-wildcard predicates), and 
neither is the query with A\ — 1, A2 =*,..., Ad — * (as it has 
only one non-wildcard predicate). 

LEMMA 8. A resolved monotonic query q has at least d/2 non- 
wildcard predicates, and hence, covers at most 2 d ' 2 points in D. 

PROOF. Let c be the constant in all the non-wildcard predicates 
of q. If q has A > 2 non-wildcard predicates, it retrieves exactly 
d — A tuples from group c, and no tuple from any other group. 
Hence, for q to be resolved, d — A cannot exceed k, that is, A > 
d-k = d/2. □ 

It turns out that if a query is resolved, it must be either diverse or 
monotonic. In fact, if a query q is neither diverse nor monotonic, 
it has at most one non-wildcard predicate. Such q must retrieve at 
least d tuples and hence, overflow (recall that d — 2k > k). 

Given two different integers x, y in [0, U— 1] , we define a bichro- 
matic set S(x, y) of points in D: 

for each i G [l,d], S(x,y) includes all the points 
that take x or y as their values on attribute Ai, except 
points (x,x,...,x) and (y,y, ...,y). 

For example, f or d = 3 and U = 3, S(l, 2) = {(1, 1, 2), (1, 2, 1), 
(1, 2, 2), (2, 1, 1), (2, 1, 2), (2, 2, 1)}. That is, S(l, 2) has all the 
points having only 1 or 2 as their coordinates, but does not contain 
(1, 1, 1) and (2, 2, 2). Clearly, there are ( 2 ) bichromatic sets, each 
of which has 2 d — 2 points. 

We are ready to explain why Q(dU 2 ) queries are necessary to 
settle Problem 1 011D. Our discussion considers only the situation 
where less than | ( 2 ) diverse queries are performed by the algo- 
rithm (otherwise, trivially there are | (^) = Q(dU 2 ) queries). By 
Lemma 6, every point of each bichromatic set must be covered by 
some resolved query. If a query q covers at least one point in a 
bichromatic set S(x, y), we say that q touches S(x, y). 

A diverse query can touch at most one bichromatic set. As there 
are less than | („) diverse queries but (^) bichromatic sets, we 
can find a bichromatic set that is touched by less than d/8 di- 
verse queries. Let S(a, 0) be that bichromatic set (for some a, j3 
in [0, U — 1]), and Q the set of diverse queries touching it (thus, 
\Q\ < d/8). 

Consider any query q G Q. Since q touches S(a, /3), q has two 
non-wildcard predicates Ai = a and Aj = ft for some [l,d] 
with i ^ j (in case multiple pairs of (i, j) satisfy this requirement, 
choose one arbitrarily). Refer to Ai and Aj as the salient attributes 
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of q. It is clear that q cannot cover those points p G S(a, /3) such 
that p has value a on both salient attributes of q. Let salient(Q) 
be the union of the salient attributes of all the queries in Q. Hence, 
\salient(Q)\ < d/4. By the previous reasoning, if a point p G 
S(a, ft) takes value a on all the attributes in salient(Q), p is not 
covered by any of the queries in Q. How many such p are there? 
As \salient(Q)\ < d/4, p can still choose a or f3 as its value on 
each of the at least attributes outside salient(Q). Excluding 

points (a, a) and f3), we know that the number of such p 

is at least 2 1+3d/4 - 2 > 2 3<i/4 . As a remark, although we required 
p to take a on the attributes in salient(Q), the argument works as 
well by taking j3. 

We have shown that at least 2 3d//4 points in S(a,P) have not 
been covered by diverse queries. Those points must be covered by 
resolved monotonic queries, each of which, by Lemma 8, contains 
at most 2 d / 2 points in D. Therefore, the number of such queries 
must be at least 2 3d/4 /2 d/2 = 2 d/4 which, as stated in Theorem 4, 
is at least dU 2 , thus completing the proof. 

5. EXTENSIONS: MIXED ATTRIBUTES 

This section will extend our techniques to handle a mixed data 
space D that has both numeric and categorical attributes. As defined 
in Section 1.1, we consider without loss of generality that the first 
cat attributes A\, A ca t are categorical, and the other d — cat 
attributes A ca t+i, Ad are numeric. 

Define D CAT as the categorical data (sub) space dom(Ai) x 
... x dom(A ca t), namely, involving all and only the categorical 
attributes. Put differently, for any point p G D, trimming off its 
coordinates on A ca t+i, Ad gives a cat-dimensional point p CA T 
in Bcat- In a natural manner, p CA T determines a set of points, which 
we denote by B num (Pcat), defined as: 

Dnum(Pcat) is the set points p G D, such that p' 
shares the same value as p CAT on every categorical at- 
tribute. 

In fact, D N um(pcat) decides a (d — mi)-dimensional numeric sub- 
space of D: it includes all the numeric dimensions of D, while 
fixing the categorical attributes to those of Pcat- 

As an example, the scenario of Figure 1 has four attributes, 
among which cat = 2 are categorical: Ai = MAKE and Ai = 
BODY STYLE. LetpcAT = (BMW, sedan). Then, O num (pcat) in- 
cludes all the points p' G D whose MAKE and BODY STYLE are 
BMW and sedan, respectively (but no constraint is imposed on 
the values of p' along the numeric attributes A?, = PRICE and 
Ai = Mileage). 

Hybrid. Next, we present an algorithm, named hybrid, to solve 
Problem 1 in a mixed D. The algorithm combines lazy-slice-cover 
(see Section 3.2) and rank-shrink (Section 2.3). Roughly speaking, 
it first enumerates all the points in D CA t using lazy-slice-cover and, 
when a point p CA T G D CA t has been reached, invokes rank-shrink in 
the numeric subspace determined by D num (Pcat). 

Now we provide the missing details. Recall that, lazy-slice-cover 
runs on a categorical server (i.e., one that supports only categorical 
attributes). To apply it on a server of our context here, we set a 
query's predicate on numeric attribute Ai (for each i G [cat + 
1, d]) to Ai G (—00,00). The effect is to disregard all the numeric 
attributes, and hence, essentially emulates a categorical server. 

Also recall that lazy-slice-cover performs an (improved) depth- 
first traversal on the data space tree T CA t built from D CA t- Consider 
that it has come to a leaf node of T CAT , or equivalently, a point 
Pcat G D CA t- In Section 3.2, the processing of p finished with 
one extra query, but hybrid invokes rank-shrink upon D num (pcat). 
Remember, however, that rank-shrink operates on a numeric server 



(i.e., one that supports only numeric attributes). To apply it on 
Bnum(Pcat), we fix a query's predicate on categorical attribute Ai 
(for each i G [1, cat]) to Ai — Ci, where Ci is the Ai value of 
Pcat- This effectively emulates a numeric server over the (d — cat)- 
dimensional numeric subspace implied by B num (Pcat). 
Upper bounds. Denote by Ui (1 < i < cat) the domain size 
of the i-th categorical attribute Ai. The following lemma gives 
the performance guarantees of hybrid, and establishes the last two 
bullets of Theorem 1 . 

LEMMA 9. When cat > 1, hybrid performs ^ min{L r i, 
j:} + Y2i=i + 0((d — cat)n/k) queries. When cat — 1, the 
number of queries is Ui + 0(dn/fc). 

PROOF. We focus on only cat > 1 because the same argument 
also applies to cat = 1. For cat > 1, the term ? Z~2i=i min{Ui, 
j} + 2~2i=i Ui is the cost of running lazy-slice-cover, and follows 
directly from Lemma 4. 

Let S be all the leaf nodes of T CA t accessed by hybrid, namely, an 
instance of rank-shrink was executed on each node u G 5*. Let n u 
be the number of tuples in D having the same value as u on every 
categorical attribute - these are the tuples 111 OiMUM (Pcat) where Pcat 
is the point in D CA t corresponding to u. As disjoint bags of tuples 
are counted by the n u of different u, we have 2~2 u es n " — n - ^ v 
Lemma 2, the instance of rank-shrink at u issues 0((d—cat)n u /k) 
queries. Hence, the total number of queries issued by rank-shrink 
is 0((d - cat) Xl^sK/fc)) = 0((d - cat)n/k). □ 

Lower bounds. We conclude this section by explaining why it is 
not possible to improve the upper bounds in Lemma 9 by more 
than a constant factor in the worst case. In fact, since a mixed data 
space is more general than both categorical and numeric spaces, 
the lower bounds in Section 4.1 and 4.2 are still applicable here. 
For example, if cat — 0, a mixed D becomes numeric. Hence, 
if there was an algorithm guaranteed faster than hybrid by non- 
constant times, that algorithm would terminate with o(dm.) queries 
on the inputs stated in Theorem 3, giving a contradiction. On the 
other hand, if cat = d, a mixed D becomes categorical. Similarly, 
if there was an algorithm faster than hybrid by non-constant times, 
that algorithm would terminate with o(dU 2 ) queries on the inputs 
in Theorem 4, again giving a contradiction. The above discussion 
has taken care of cat > 1 or cat = 0, but an analogous argument 
apparently applies to the remaining case cat — 1 as well. This, 
together wish Theorems 3 and 4, establish Theorem 2. 

6. EXPERIMENTS 

In this section, we empirically evaluate the proposed techniques, 
and establish their superiority over alternative solutions. 

Data. Our experiments were based on three real datasets, whose 
attributes, as well as the domain size of each attribute, are given in 
Figure 9 (where num indicates a numeric attribute). Specifically: 

• Yahoo contains 69,768 tuples in a hidden database at au- 
tos.yahoo.com. Each tuple depicts 6 attributes of a vehicle. 
This is a mixed dataset (due to the presence of both numeric 
and categorical attributes). 

• NSF contains 47,816 tuples in a hidden database at 
nsf.gov/awardsearch. Each tuple has 9 attributes of an NSF 
award. This is a categorical dataset. 

• Adult contains 45,222 tuples in a census dataset that can 
be downloaded from archive. ics.uci.edu/ml/datasets/adult. 
Each tuple describes 14 attributes of a person working in the 
US. As with Yahoo, this is also a mixed dataset. 
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Figure 9: Attributes and their domain sizes of the datasets deployed 
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Figure 10: Query cost of numeric algorithms (dataset Adult-numeric) 
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Figure 11: Query cost of categorical algorithms (dataset NSF) 



We also extracted a numeric dataset from Adult, by including only 
its numeric attributes. The resulting dataset, named Adult-numeric, 
therefore has the same cardinality and dimensionality as Adult. 

Recall that every algorithm in Sections 2, 3 and 5 works with 
an ordering of the attributes in the underlying dataset (i.e., which 
attribute is Ai, which one is A2, and so on). The ordering is ex- 
actly as shown in Figure 9 (from left to right in each table), and 
is the same for all algorithms. Since our experiments require ad- 
justing the value of k, we implemented a local server to run our 
algorithms. Our implementation conforms strictly to the problem 
setup in Section 1.1, so that the cost reported would be equivalent if 
the algorithms were executed on a remote web server. In a dataset, 
each tuple is assigned a random priority, so that if a query over- 
flows, always the k tuples with the highest priorities are returned. 

Numeric algorithms. We start by studying the performance of nu- 
meric algorithms binary-shrink and rank-shrink in Section 2, using 
dataset Adult-numeric. For each algorithm, Figure 10a shows the 
number of queries it issued to extract Adult-numeric as a function 
of k. Setting k to the median value 256, Figure 10b plots their 
efficiency as the dimensionality d varies from 3 to 6. In this ex- 
periment, for each d € [3, 6], we created a d-dimensional dataset 
by taking the d attributes of Adult-numeric that have the highest 
numbers of distinct values. Specifically, the attribute with the most 
distinct values is FNALWGT, the second is CAP-GAIN, followed by 
Cap-loss, Wrk-hr, Age and Edu-num. Using k = 256 and 
fixing d to its original value 6, Figure 10c compares the two algo- 
rithms with respect to the dataset size n. Here, a 20% dataset corre- 



sponds to a random sample set of Adult-numeric, by independently 
sampling each of its tuples with a 20% probability. The datasets of 
the other percentages were generated in the same fashion. 

Rank-shrink consistently outperformed binary-shrink in all 
cases. Furthermore, as predicted by our analysis in Section 2.3 
(particularly, Lemma 2), the cost of rank-shrink was linear to n and 
inversely linear to k (to observe the inverse linearity, notice from 
Figure 10a that rank- shrink entailed half as many queries each time 
k doubled). As a pleasant surprise. Figure 10b demonstrates that 
the cost of rank-shrink stayed nearly the same as d increased, even 
though Lemma 2 indicates that the cost should grow linearly in the 
worst case. In fact, by looking at the proof of Lemma 2, one would 
realize that the presence of d in the final time complexity is due 
to 3-way splits (see Figure 2). Such a split happens only if many 
tuples share the same value on a certain attribute. As this is not 
true in Adult-numeric, 3-way splits were seldom performed, which 
explains the phenomenon in Figure 10b. 

Categorical algorithms. We deployed a similar methodology to 
compare the efficiency of algorithms DFS, slice-cover and lazy- 
slice-cover in Section 3. Note that DFS is in fact the baseline ap- 
proach for crawling which was outlined in [15]. The underlying 
dataset was NSF. Figure 1 1 presents the results for the same set of 
experiments as in Figure 10. It is worth pointing out that, in Fig- 
ure lib, a d-dimensional dataset (d £ [5, 9]) was generated from 
NSF in the way as mentioned earlier for Figure 10b (the number of 
distinct values on each attribute equals the attribute's domain size, 
which can be found in Figure 9). 
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Figure 12: Cost of the mixed algorithm hybrid 

Interestingly, slice-cover, even being asymptotically optimal, 
turned out to exhibit the worst performance. Of course, this does 
not contradict our theoretical analysis, because the optimality of 
slice-cover is reflected on the hardest dataset (see, for example, 
Figure 8). NSF is not such a dataset; hence, slice-cover does not 
guarantee better efficiency than a suboptimal solution like DFS. It 
is important to note that lazy-slice-cover was the clear winner in all 
the experiments (notice that the y-axes of all diagrams in Figure 1 1 
are in log scale). The huge improvement of lazy-slice-cover over 
slice-cover confirms the necessity and effectiveness of the heuristic 
discussed in Section 3.2. 

Hybrid algorithms. Having demonstrated the superiority of rank- 
shrink and lazy-slice-cover over their competitors, next we evaluate 
the behavior of their combination: the hybrid algorithm in Sec- 
tion 5. For this purpose, we employed both mixed datasets Yahoo 
and Adult. Figure 12 illustrates the number of queries performed by 
hybrid to crawl each dataset entirely, as k changes from 64 to 1024. 
Note that there is no reported value for Yahoo at k = 64 because it 
has more than 64 identical tuples (i.e., they agree with each other 
on every dimension) - for the reason explained in Section 1.1, no 
algorithm can successfully extract the dataset in full when k = 64. 

In practice, it would be a nice property for a crawling algorithm 
to be able to return the tuples of a hidden database in a progres- 
sive manner. Namely, it should gradually churn out new tuples 
as it runs, instead of outputting most tuples only at the end. This 
property allows the crawler to terminate the algorithm at any mo- 
ment, while still able to obtain a number of tuples proportional to 
the amount of time that has been spent. Motivated by this, the last 
set of experiments examines the progressiveness of hybrid. In Fig- 
ure 13 (where k was set to 256), for each dataset, we present the 
percentage of the tuples extracted (the y-axis) against the percent- 
age of the queries issued (the x-axis). For example, a point (20%, 
30%) in this figure means that, hybrid is able to discover 30% of 
the tuples in the dataset, at the moment when it has issued 20% 
of all the queries that eventually need to be performed. We were 
delighted to observe linear progressiveness for both datasets, as is 
clear from Figure 13. 

7. CONCLUSIONS 

Currently, search engines cannot effectively index hidden 
databases, and are thus unable to direct queries to the relevant data 
in those repositories. With the rapid growth in the amount of such 
hidden data, this problem has severely limited the scope of informa- 
tion accessible to ordinary Internet users. In this paper, we attacked 
an issue that lies at the heart of the problem, namely, how to crawl 
a hidden database in its entirety with the smallest cost. We have 
developed algorithms for solving the problem when the underlying 
dataset has only numeric attributes, only categorical attributes, or 
both. All our algorithms are asymptotically optimal, i.e., none of 
them can be improved by more than constant times in the worst 
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Figure 13: Output progressiveness of hybrid (k = 256) 

case. Our theoretical analysis has also revealed the factors that de- 
termine the hardness of the problem, as well as how much influence 
each of those factors has on the hardness. 
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